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1. INTRODUCTION 

Sexually Transmitted Infections (STI) are Sexually Transmitted Disease (STD), both with infected 
partners and those who frequently alternate partners [1]. More than 30 different bacterias, viruses and 
parasites that can be transmitted by sexual intercourse. In STD there are 8 frequent infections of 8 infections 
with these newly curable 50% syphilis, gonorrhoea, chlamydia and trichomoniasis. The other 50% that 
cannot be cured include hepatitis B, HSV (Herpes Simplex Virus) or herpes, HIV, HPV (Human 
Papillomavirus) [2]. 

In Indonesia, especially in the Malang city, the number of people with HIV / AIDS in 2014 as many 
as 466 people, AIDS reach 225 people, syphilis 14 people [3]. According to a Malang city health report the 
average patient aged 25-49 years [3]. Some STD occur asymptomatic [2]. Asymptomatic is a disease when 
the patient is not aware of any symptoms. Asymptomatic may not be detected until the patient performs a 
medical test. The Malang city is growing every year, both in social, demographic, and population migration 
[4]. Besides being known as a tourist destination, Malang city also known as a city of education so that every 
year the population number increases [3], [5], it is at risk of increasing the spread of STD virus. Treatment 
when infected with STD virus requires much cost [1]. Therefore, it is important to treat earlier sufferers STD 
virus in order to reduce the burden of patient spending. 

Along with technological developments in this modern era detect STD can utilize information 
technology. One way is to build a system that can help patients to detect STD early and treat individually [6]. 
Some previous studies have been conducted by Lakshmi and Isakki [7] about comparing several methods in 
data mining such as Decision Tree, Support Vector Machine (SVM), Naive Bayes for predicting HIV / AIDS 
disease. The results of the study showed that Decision Tree received the highest accuracy of 90.0741%, then 
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SVM with 85.05% and lowest Naive Bayes with 77.5% accuracy. The second study was conducted by 
Dangare and Apte [6] on cardiovascular prediction systems using a comparison of 3 methods: Neural 
Network, Decision Tree And Naive Bayes. The conclusion from the research is that by using Neural Network 
method on detection of heart disease get 100% accuracy means that between original data and computer 
result have no bias. Subsequent research on the prediction of anti-retroviral drug consumption based on 
previous consumer drug data on pharmacy at Jugal hospital [8] using M5P tree model got the best result. 
Subsequent research was conducted by Kaur and Bawa [9] on suitable methods in predicting various diseases 
and the results of his research stated that the Decision Tree method obtained an accuracy of 95%. There are 
some previous studies relevant to this issue, further to be discussed more clearly in the literature studies. 

This issue is important to solve because the disease is very dangerous and will be fateful if not 
treated immediately as soon as possible. Therefore, this research will classify STD diseases based on existing 
symptoms with future goals if there is a new symptom can be detected early disease. Author will focus on 
testing the accuracy of three data mining methods which are Naive Bayes, K-Means and K-Nearest Neighbor 
(K-NN) against the STD disease classification in addition to testing research methods also aimed at early 
detection of STD. So, in the final conclusion we will get the best method for classifying STD. 


2. LITERATURE STUDY 

Data mining is also called knowledge discovery in database [10]. The problem of classification of 
diseases like this can be said included in the data mining case because it requires the existence of knowledge 
before it can be classified. A lot of methods in data mining that can be used, but in this research the authors 
focus to compare three methods namely Naive Bayes, K-Means and K-NN. 

The Naive Bayes method can be applied to solve classification problems as in previous studies. Patil 
et al., [11] applied the Naive Bayesand J4.8 Decision Tree methods in classifying data. Then from both 
methods compared to their performance and based on the results of the experiment it can be concluded that 
the Naive BayesMethod is more efficient. Durgalakshmi et al., [12] implemented an improved version of the 
Naive Bayesmethod for classifying breast cannabis diseases. The improved version of the Naive 
Bayesmethod lies in its performance and accuracy calculations. In contrast, Griffis et al., [13] adopted an 
automated approach to identify stroke by using the Naive Bayes method. While Lakoumentas et al., [14] 
optimized the method of Naive Bayesin the classification of B-Chronic Lymphocytic Leukemia (B-CLL) 
disease. The difference with conventional Naive Bayesmethods lies in attributes when classifying discrete 
values and optimizing their accuracy values. 

In addition to the method of Naive Bayes, can also do the classification by using K-Means method. 
Cimen et al., [15] applied the K-Means method to classify Arrhythmia based on Conic's olyhedral function 
algorithm. The performance test results are shown through numerical experiments, while the accuracy is 
98%. Khanmohammadi [16] improvised the K-Means method for medical applications. Improvisation is 
done by using overlapping technique, which is a technique derived from conventional method K-Means. 
Overlapping K-Means (OKM) is considered to be efficient in classifying data for medical applications. 
Anand [17] detected plant diseases in the Brinjal leaves through image processing techniques. In the process 
of detecting the image, the researcher uses K-Means method for segmentation and Neural Network method 
for classification. There is a renewal framework for classifying symptoms of Syncope disease. Using the K- 
Means method, Guftar [18] predicts the major causes that can cause Syncope's disease. The results of his 
experiments were compared with other methods such as K-Means fast, K-Medoids and X-Means. While 
Santhanam et al., [19] combines three methods at once namely the method of K-Means, Genetic Algorithm 
and Support Vector Machines (SVM) to diagnose diabetes. The K-Means method is used to eliminate noisy 
data, Genetic Algorithms are used to find optimal features whereas SVM is used for classification. From the 
results of the experiment obtained an accuracy of 96.71%. 

Udovychenko et al., [20] classifies heart failure by using K-NN Binary. From the experiment results 
obtained 80-88% accuracy range, 70-95% sensitivity, 78-95% specification and 77-93% precision. In another 
research, Udovychenko et al., [21] classified the Ischemic heartbeat using the K-Means method. Based on the 
results of the experiment, the optimal number of neighbors in increasing accuracy was 20-25 neighbors. 
Another case with Saha et al., [22] classified gene selection using K-NN and other heuristic methods. The 
K-NN method is used to classify the example. While the heuristic method chosen is Simulated Annealing 
(SA) and Particle Swarm Optimization (PSO). Based on the results of the experiment, researchers claim that 
the SA method is better than the PSO. 

Based on some previous studies that apply any methods with their advantages. So, the authors will 
perform accuracy analysis of three data mining methods in classifying sexually transmitted diseases. The 
three methods are Naive Bayes, K-Means and K-NN. 
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3. SEXUALLY TRANSMITTED DISEASE 

Sexually Transmitted Disease (STD) is a disease that is transmitted from one person to another 
through sexual contact. As a result of STD the occurrence of reproductive tract infections so that if not 
treated immediately the infection will cause patients with prolonged illness, infertility and death. Symptoms 
of STD include secretion or pus from the penis, vagina or anus, the onset of pain or feeling of heat during 
urination, the presence of lumps, nodules or wounds on the penis, vagina, anus or mouth, the occurrence of 
swelling in the thighs, the occurrence of bleeding after sex, The onset of pain in the lower abdomen (woman) 
and pain in the testicles. Some type of STD can be seen in Table 1. 


Table 1. Type of Sexually Transmitted Disease 
Diseases Symptoms Caused by Bacterial 
Gonorrhoea a. Urge to urinate Neiseria Gonorrhoeae 
b. Pain when urinating 
c. The discharge of white fluid from the vagina and this spread 
can reach the cervix, uterus, fallopian tubes, ovaries, urethra 
(ower urinary tract) 
d. Pain in the hip or pain during sexual intercourse. 
Syphilis a. Symptoms last 3-4 weeks sometimes up to 13 weeks later a Treponema Pallidium 
lump around the genitals 
b. Accompanied by dizziness 
c. Bone-like bone pain that will go away without treatment 
d.Reddish spots on the body about 6-12 weeks after intercourse 
Herpes Genitalis Arise for 1-3 weeks of watery (clotted grape-like) pee in the Herpes Simplex type 2 (HSV-2) 
vicinity of the genitals, then rupture and leave the wound dry, 
then disappear and the symptoms recur again as above but not 
senyeri early stage. 
Chlamydia a. Arising inflammation of the male and female reproductive Chlamydia 
organs 
b. Discharge of fluid from genitals or whitish yellowish white 
c. Pain in the pelvic cavity and after-sex bleeding 


Trichomoniasis a. Dilute vaginal fluid yellowish, foamy and foul-smelling Trikomonas Vaginalis 
Vaginalis b. Vulva slightly swollen, redness, itching and feel 
uncomfortable 
c. Pain during intercourse or while urinating 
Genital Warts a. Infected women about the skin of the genital area to the anus, Human Papiloma Virus 


the mucous membrane inside the genitals to the cervix. 

b. Pregnant women infected with warts can grow large, genital 
warts can sometimes lead to cervical cancer or skin cancer 
around the genitals. 

c. Infected men about genital and urinary tract. 

Chancroid a. There are wounds that fester or acute rot and pain in the Haemophilus Ducreyl 

genitals, diameter size less than 1 cm 

b. The swelling of the sore from the gland. 


Limphogranulama A small sore that does not hurt in the genital area and followed Chlamydia Trachomatis 
Venereum by painful swelling. 
Granuloma Inguinale There is a small cut on the skin of the genitals and will spread to Donovania Granulomatis 


form a mass of graulomatus (small bumps) that can cause severe 
damage to the pubic organs. 
Cervicitis - Frequent urination Bacterial Infections 
- Pain during urination 
- Pain during intercourse 
- Abnormal vaginal bleeding 
- Vaginal discharge. 
Vaginal Candidiasis Itching and irritation of the vagina and vulva (skin folds outside Candida Albicans 
the vagina) accompanied by a vaginal secret that is white, thick, 
resembling cheese. 


Bacterial Vaginosis - Out dilute liquid, white or gray Polimikroba 

- Smell of vagina 
Mollusculum Occurs in the form of papules (slippery bumps), no pain and can Virus Infections 
Contagiosum disappear by itself without treatment 
Proctitis Pain in the rectum. - 
Neonatal An infection of the conjunctiva (the white part of the eye) and Streptococcus Pneumoniae, 
Conjunctivitis the membrane lining the eyelids. Hemophilus Influenzae, 


Neisseria Gonorrhoeae and 
Herpes Symplex Virus 
Pelvic Inflammation Inflammation or infection of organs in the female pelvis. The - 
pelvic organs include the uterus (uterus), fallopian tubes 
(oviduct), ovaries, and cervix. 


Source: [1] 
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4. RESEARCH METHOD 

In this research, using 139 sample data from one of hospitals in Malang city. The data consist of 109 
testing data and 30 training data. From 109 testing data there are 16 classes of diseases and each disease 
consists of 29 symptoms. 16 such diseases are Gonorrhoea, Syphilis, Herpes Genitalis, Chlamydia, 
Trichomoniasis Vaginalis, Genital Warts, Chancroid, Lymphogranuloma Venereum, Granuloma Inguinale, 
Cervicitis, Vaginal Candidiasis, Bacterial Vaginosis, Mollusculum Contagiosum, Neonatal Conjunctivitis, 
Proctitis and Pelvic Inflammation. In this research, the authors applied three data mining methods for STD 
classification. There are Naive Bayes, K-Means and K-NN. Next will be discussed each methods. 


4.1. Naive bayes 

Naive Bayes is a classification method using simple probabilities by computing a set of probabilities 
and summing the frequency and value combinations of the given dataset [23]. This method assumes all 
attributes to be independent (not interdependent) given by the value of the class variable [11]. The advantage 
of Naive Bayes is that it is easy to construct does not require complicated parameter estimation schemes, it is 
easy to apply to large data sets, the classification results are easily interpreted by the layman [24]. The 
equations of Naive Bayes [25] shown in Equation (1). 


P(X|H) x P(H) 
P(X) 


P (H|X) = (1) 


Where, 


X: Data with unknown class 

H : The data hypothesis is a specific class 

P (H | X): The probability of hypothesis H is based on condition X (posteriori probabilities) 
P(H) : Probability of hypothesis H (prior probability) 

P (X | H): The probability of X is based on the conditions in hypothesis H 

P(X)  : Probability X 


4.2. K-means 

K-Means is one of the simple form of unsupervised learning algorithms [26]. K-Means is a method 
using a centroid model in which the centroid is the midpoint of a cluster and is usually in the form of a value 
[25]. The function of the centroid to calculate the distance of a data object against the centroid [24]. The 
K-Means steps are [27]: 
a. Initialize, determine the value of K as the cluster. If necessary specify the threshold of the change of 

objective function (the limit determines the iteration stops) and the threshold of centroid position change. 

b. Determining the centroid value of K data from the data set X. 
c. Calculate the metric distance of the object with the centroid shown in Equation (2). 


dap) = V dizi — bi)? (2) 
Where, 

diab) : Distance of object between object a and b 

n : Dimension of data 

di : Coordinate of object a on dimension n 

bi : Coordinate of object b in dimension n 


d. Classify objects based on the minimum distance of the centroid. 

e. Repeat steps 3 and 4 until reaching convergent conditions are reached where the change of objective 
function is below the threshold or no cluster-shifting data or the centroid position change is below the 
threshold. 


4.3. K-nearest neighbor 

K-NN is a non-parametric classification method [28]. Computationally, it is simpler than other 
methods. K-NN works by calculating the proximity between a new case and an old case based on matching 
weights of a number of existing features [25]. Identify with this method based on the similarity with the 
previous case. Here to calculate the similarity between new cases and old cases with the following 
Equation (3). 


Int J Elec & Comp Eng, Vol. 8, No. 5, October 2018 : 3933 — 3939 


Int J Elec & Comp Eng ISSN: 2088-8708 O 3937 


X(T, Si)}wi 


similarity (T,S) = v (3) 
i 
Where, 
T : Mew case 
S : existing cases in storage 
N : attribute in each case 
i : an individual attribute between 1 to n 
f : Similarity attribute 7 function between case T and case S$ 
w : the weights assigned to attribute i 
1 
wi (4) 


a(x! xj)? 


Proximity usually lies in the value between 0 until 1. The value O means that both cases are 
absolutely unlike, otherwise for a case value of | case is similar to absolute. The K-NN steps are: 
a. Determine the parameter K (the number of nearest neighbors) 
b. Calculate the square of the Euclidean Distance of each object with the sample data. 
c. Sort by the smallest Euclidean distance. Euclidean distance formula shown in Equation (5). 


D (x,y) = YÈg-1(xk — yk)? = a) (Gy — b1)? + (az = b2)? + = + (an — Dy)? (5) 


d. Collecting category Y (K-NN classification) 
e. Predicting query values by majority category. 


5. EXPERIMENTAL SETUP AND RESULTS 

This experiment using 30 testing data that will be compared with the actual data to find out the 
accuracy of these methods implementation. Measuring the accuracy using percentage theory shown in 
Equation (6). 


È correct data 


Accuracy = x 100% (6) 


È} testing data 


The result of STD classification using that three algortihms shown in Table 2. 


Table 2. Result of STD Classification 


Methods Data Accuracy 
Correct Incorrect 
Naïve Bayes 23 1 76.67% 
K-Means 3 27 10% 
K-NN 27 2) 90% 


From the results of these experiments, the K-Means method showed the worst results. It can be due 
to the initialization of K randomly points, so if the random value is worst then the result of classification 
becomes less optimal. The result of K-Means method is not optimal also can be due to the value of the 
parameters of each symptom, which is only two values i.e. 0 and 1. Instead of fuzzy value whose the value 
can be more varied, so that more optimal in doing classification. Another factor that can affected is the 
dimension of data which is in this research using data with 29 dimensions. It cause the K-Means difficult in 
determining the appropriate K value. K-Means characteristic itself is more suited to clustering problems than 
classification problems. 

Better results obtained with Naïve Bayes method with the result 76.67% match. This can happen 
because Naïve Bayes has advantages in dealing with classification problems, especially quantitative data. 
With relatively little training data, this method is able to get optimal result. This is also supported by the 
advantages of Naïve Bayes in dealing with missing values as well as stronger attributes that are less relevant. 

The best result is obtained by using K-NN method. The K-NN method can produce a more accurate 
and effective classification than any other method when the training data is large enough as in the 
experiment. This method is also robust dealing with noise data. In applying these three methods, there are 
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some incorrect data in the classification. The occurrence of misclassification is due to the different types of 
diseases of some of the same symptoms of the patient. For example that is when classifying Scrotal Welling 
and Bacterial Vaginosis diseases. In the actual data there are some patients with similar symptoms but the 
type of disease suffered differently. This can make accuracy in doing classification less optimal. 


6. CONCLUSION AND FUTURE WORK 

In this research, authors has done experiment using data mining methods for STD classification 
using 139 data, 109 as training data and 30 as testing data. The best method for STD classification is K-NN 
with the highest accurary that is 90%. The accuracy results are quite good, but in the future the authors will 
conduct research to improve the accuracy of classification results by optimizing parameters of K-NN method 
that has been done in our previous research for other classification problems [29], [30]. In addition, in the 
next research will also test the amount of training data and testing data in order to obtain accurate 
classification results. 
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